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(57) ABSTRACT 

The present invention provides a method and apparatus for 
the lexical analysis of computer source code. The lexical 
analyzer is dynamically configured at runtime to recognize 
a one or more reserved words or operators. Thus, the 
analyzer has the ability to interact with multiple languages. 
In one or more embodiments of the present invention, the 
analyzer is instantiated by a host application, for example, 
the parser of a compiler. The host application adds a list of 
tokens to the analyzer that must be recognized. These tokens 
comprise at least a subset of the reserved words and opera- 
tors of the computer language. In one embodiment, the host 
application then queries the analyzer for the next token in the 
source code. In another embodiment, tokens are added 
during the query phase as needed. In a separate embodiment, 
tokens are dynamically removed firom the analyzer as the 
needs of the host application change. 
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METHOD AND APPARATUS FOR DYNAMIC 
CONFIGURATION OF A LEXICAL ANALYSIS 
PARSER 

BACKGROUND OF THE INVENTION 
[0001] 1. Field of the Invention 

[0002] The present invention relates to the field of com- 
puter software, and in particular to a lexical analyzer that can 
be configured at runtime to accept multiple languages. 

[0003] Sun, Sun Microsystems^ the Sun logo, Solaris and 
all Java-based trademarks and logos are trademarks or 
registered trademarks of Sun Microsystems, Inc. in the 
United States and other countries. All SPARC trademarks 
are used under license and are trademarks of SPARC Inter- 
national, Inc. in the United States and other countries. 
Products bearing SPARC trademarks are based upon an 
architecture developed by Sun Microsystems, Inc. 

[0004] 2. Background Art 

[0005] Computer software, which comprises one or more 
computer instructions, must be processed by a system 
known as a "compiler^ before it can be executed by an 
intended computing environment. More specifically, the 
software steps by which a human is able to give instructions 
to a computer must be transformed by the compiler into a 
machine readable form for execution by procesang hard- 
ware units. Thus, the function of a compiler is to transform 
computer instructions existing in a first representation (i.e., 
one understarxlable by a human) to computer instructions 
existing in a second representation (i.e., one understandable 
by a madiine). 

[0006] One component of a compiler is called a lexical 
analyzer. The lexical analyzer scans the characters of the 
source code and divides them into tokens for use in later 
compilation steps. Current lexical analyzers are static, mean- 
ing they will only scan for tokens known at the time the 
lexical analyzer was made. Thus, each lexical analyzer is 
bound to a certain token set which cannot easily be changed. 
Before discussing this problem, an overview of a compiler 
is provided. 

Compiler 

[0007] FIG. 1 shows the steps taken by an ordinary 
compiler. As illustrated in FIG. 1, the compiler comprises a 
parser 101, a translator 103, and a code generator 105. The 
parser 101 receives input in the form of source files 100 
(e.g., C++ .cpp and .hpp files) and generates a high-level 
representation 102 of the source code. This high-level rep- 
resentation 102 may include, for example, a tokenized 
version of the source code file. The translator 103 receives 
the high level representation 102 and translates the opera- 
tions into an intermediate form 104 that describes the 
operations. The intermediate form 104 is transformed by 
code generation process 105 into executable code 106 
configured to run on a specific platform. 

[0008] Compilers must parse source code to be able to 
translate it into object code. Parsing is often divided into 
lexical analysis and semantic parsing. 

Tokens 

[0009] Lexical analysis concentrates on dividing strings 
into components, called tokens, based on punctuation and 



other keys. Semantic parsing then attempts to determine the 
meaning of the string. A token is a sequence of characters 
that is treated as a unit in the grammar for a programming 
language. Tokens are grouped into types. Each token type is 
described by a pattern. A lexeme is the set of specific 
characters &om a source file that match a pattern. Each 
language has its own token types, patterns and lexemes. 

[0010] Token types include numbers, string literals, iden- 
tifiers, character constants, reserved words (or keywords) 
and operators. Keywords are sequences of letters and pos- 
sibly other characters that arc reserved to the language. 
Common examples are "while", "if* and "return". Each 
keyword is a token. Operators are character sequences 
consisting of non-alphanumeric characters and arc used by 
the language to represent operations. The operator may have 
one or morc characters aiKl must be unique. Examples are 
and "(**. like the keyword token type, each 
operator is a token. 

[0011] Each token pattern defines a language. Thus, the 
language for numbers is the set of all strings consisting only 
of the digits 0 through 9. The language for the reserved 
word, "iD8 consists of the single string, "ir. 

[0012] Certain source code structures do not constitute 
tokens. For example, comments, pre-processor directives, 
and spaces do not constitute tokens. 

[0013] The token set is critical because it defines the 
operations comprising a computer program. Each program- 
ming language has a imique set of tokens. As such, each 
programming language requires a unique lexical analyzer 

Lexical Analysis 

[0014] Lexical analyzers are typically subroutines of pars- 
ers. The parser invokes the lexical analyzer when it needs to 
examine the next token in a sequence. When the lexical 
analyzer is invoked, it reads input characters until it reaches 
the next token. 

[0015] An example of a lexical analyzer is called Lex. 
Using Lex, a separate file containing definitions, analyzer 
rules and user subroutines must be written before source 
code can be analyzed by Lex. 

[0016] Thus, Lex is a static program that is either gener- 
ated by a tool to understand certain tokens or is programmed 
by hand. There is no way to instruct a lexical analyzer at 
runtime to understand new or added tokens in different 
languages. This approach is problematic because tokens can 
only be added by modifying the source code for the analyzer. 
This process is slow, prone to error and expensive. 

SUMMARY OF THE I>fVENTION 

[0017] The present invention provides a method and appa- 
ratus for the dynamically configurable lexical analysis of 
computer source code. The lexical analyzer is dynamically 
configured at runtime to recognize a one or more reserved 
words or operators. Thus, the lexical analyzer has the ability 
to interact with multiple languages without being rewritten 
horn scratdi. 

[0018] In one or more embodiments of the prcsent inven- 
tion, the analyzer is instantiated by a host application, for 
example, the parser of a compiler. The host application adds 
a list of tokens to the analyzer that must be recognized. 
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a -^^^^S'^^SJ^^rmouse are for intro- 
a system bus 718. IIMS Kcyw * communicat- 

du^ user input to^oo«^»te^^ 
ing that user mpul to "^'^^P"^^ addition to, or 
oLr suitable input devKCS ^^^^ ^^^^^ 

in place of, ihe «no"^ system bus 718 

r^nr:.r./?£^:-ts:-^^-^^-^ 

I/O etc 



types of information. 

, * i-nt 721 typically provides daU commu- 
[0042] Netwo* hnk 721 devices, 
nication through oneormo^ne^teto^^ ^ 

For example. netwoA ^^^J^*^^, computer 723 or 
toougb local netwo* T^J^^^^^. jsP 724 in turn 
to daU equipment '^"^^^ j^cTthrough the world 
provides data «>««°'»"^'J!°ri^^^A now commonly 
^e packet data S network 722 and 

referred to as the """^"^^..^^agnetic or optical 
Internet 725 both use elef^'cal, elertrotwgn^ 

of carrier waves tran^ning ui^ 

[00431 Proc««>r 713 may may 
tuter 701 or wholly on ^/^J^^^n computer 
Lve P-^l^T^^^SScally is represent^ 

701 and server T^ .^^flJ^^ can ali be distributed 

innG.7^^^^„^^J^,Z^.U server^ 
between multiple f ^ere application logic 

comprises a middte and back ^er wne ff ^ 
eZtesin Ihermddle '•^T'^S 7^resideswhoUy 

U.e back tier. In the "^^.^^f^ations performed by 
on server 726, the resultsof&ec^u F^^^ 

pR«ssor 713 are.tra^tted J^««nPje ^ 

?25. interne. Seoa«P™-^^^^ 

fahtrSU tSe^^-^ the «.mp»taUon to auser m 

the fonn of output. 



the fonn ot ouipui. 

[0044] Computer701incla^a^-'^pS^oT- 
inemory 715 ma^ stomge 
direcdonalsystembt^TWd^^ ^ 

7U and processor 713 '^T,'" ^ 715 and mass stor- 
computingenvirormien^ mam mW^ 

ag, 712, can Sn tte two. Examples of 

or they may be <»>*'"^"lf ^„ memory 715. and mass 

systems where P«f?^' ^^;^en J^„,pul«701 and*^^ 
J^gpTUaredisUibutedlKWeen^P^^^^^ ^^^j. 

726 include the ttim^»ent ^"^^^ computing 

SXI^rXXl^tcoX*^— 
»St3T^=olSa--Iopedby 

Sun Microsystems, Inc. 

T17 mav indiide both fixed and 
[00451 The ^ « - 
removable media, such as magj. j ^^3^5 storage 

optical storagesysten. or ^yo^^^^^ toty-W,o 

technoloar. ^^j. "^j!! vide^ memory 714 or mam 
address lines fol^"*^""^ ^718 also inctades, for 
memory 715. The system bus 7iii ^ 
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example, a 32-bit data bus for traosfening data between and 
among the components, such as processor 713, main 
memory 715, video memory 714 and mass storage 712. 
Alternatively, multiplex data/address lines may be used 
instead of separate data and address tines. 

[0046] In one embodiment of the invention, the processor 
713 is a SPARC microprocessor from Sun Microsystems, 
Inc., a microprocessor manufactured by Motorola, sudi as 
the 680X0 processor, or a microprocessor manufactured by 
Intel, such as the 80X86 or Pentium processor. However, any 
other suitable microprocessor or microcomputer may be 
utilized. Main memory 715 is comprised of dynamic random 
access memory (DRAM). \^deo memory 714 is a dual- 
ported video random access memory. One port of the video 
memory 714 is coupled to video amplifier 716. The video 
amplifier 716 is used to drive the cathode ray tube (CRI) 
raster monitor 717. \^deo amplifier 716 is well known in the 
art and may be implemented by any suitable apparatus. This 
circuitry converts pixel data stored in video memory 714 to 
a raster signal suitable for use by monitor 717. Monitor 717 
is a type of monitor suitable for displaying graphic images. 

[0047] Computer 701 can send messages and receive data, 
including program code, through the networks), network 
link 721, and communication interface 720. In the Internet 
example, remote server computer 726 might transmit a 
requested code for an application program through Internet 
725, ISP 724, local network 722 and communication inter- 
face 720. The received code may be executed by processor 
713 as it is received, and/or stored in mass storage 712, or 
other non-volatile storage for later execution. In this manner, 
computer 700 may obtain application code in the form of a 
carrier wave. Alternatively, remote server computer 726 may 
execute applications u^g processor 713, and utiHze mass 
storage 712, and/or video memory 715. The results of the 
execution at server 726 are then transmitted through Internet 
725, ISP 724, local network 722 and communication inter- 
face 720. In this example, computer 701 performs only input 
and output functiois. 

[0048] y^lication code may be embodied in any form of 
computer program product. A computer program product 
comprises a medium configured to store or transport com- 
puter readable code, or in whidi computer readable code 
may be embedded. Some exanoples of computer program 
products are CD-ROM disks, ROM cards, floppy disks, 
magnetic tapes, computer hard drives, servers on a network, 
and carrier waves. 

[0049] The computer systems described above are for 
purposes of example only. An embodiment of the invention 
may be implemented in any type of computer system or 
programming or processing environment. 

[0050] Thus^ a dynamically configurable lexical analyzer 
is described in conjunction with one or more specific 
embodiments. The invention is defined by the following 
claims and their fiill scope an equivalents. 



1. A method for converting a source program into one or 
more tokens, comprising: 

obtaining one or more entries; 

analyzing said source program; and 



generating said tokens fi'om said source program, wherein 
said entries may be used to generate a subset of said 
tokens. 

2. The method of claim 1 wherein said entries are com- 
prise a language descriptor and a token value. 

3. The method of claim 2 wherein the analyzing com- 
prises: 

obtaining a lexeme from said source program; and 

determining if said lexeme matches one of said language 
descrq)tors. 

4. The method of claim 3 wherein the analyzing further 
comprises: 

obtaining said token value if said lexeme matches one of 
said language descriptors. 

5. The method of claim 4 wherein the analyzing further 
comprises: 

obtaining a next lexeme from said source program. 

6. The method of claim 5 wherein the generating com- 
prises: 

outputting said token value in response to a request finom 
a host program 

7. The method of claim 6 wherein said language descrip- 
tor is a reserved word. 

8. The method of claim 6 wherein said language descrip- 
tor is an operator. 

9. The method of claim 1 wherein the obtaining further 
comprises: 

enterir^ said token entries into a token dictionary. 

10. A computer program product comprising: 

a computer usable medium having computer readable 
program code embodied therein configured to convert 
source program into one or more tokens, said computer 
program product comprising: 

computer readable code configured to cause a computer 
to obtain one or more entries; 

computer readable code configured to cause a computer 
to analyze said source program; and 

computer readable code configured to cause a computer 
to generate said tokens from said source program, 
v^erein said entries may be used to generate a subset 
of said tokens. 

11. The computer program product of claim 10 wherein 
said entries comprise a language descriptor and a token 
value. 

12. The computer program product of claim 11 wherein 
said computer code configured to cause a computer to 
analyze the source program comprises: 

computer readable code configured to cause a computer to 
obtain a lexeme from said source program; and 

computer readable code configured to cause a computer to 
cktermine if said lexeme matches one of said language 
descriptors. 

13. The computer program product of claim 12 wherein 
said computer code configured to cause a computer to 
analyze said source program further comprises: 

computer readable code configured to cause a computer to 
(^tain said tc^en value if said lexeme matdies one of 
said language descriptors. 
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14. The computer program product of claim 13 wherein 
said computer code configured to cause a computer to 
analyze said source program further comprises: 

computer readable code configured to cause a computer to 
obtain a next lexeme from said source program. 

15. The computer program product of claim 14 wherein 
said computer code configured to cause a computer to 
generate tokens comprises: 

computer readable code configured to cause a computer to 
output said token value in response to a request from a 
host program 

16. The computer program product of daim 15 wherein 
said language descriptor is a reserved word. 

17. The computer program product of claim 15 wherein 
said language descriptor is an operator. 

18. The computer program product of claim 10 wherein 
said computer code configured to cause a computer to obtain 
one or more entries further comprises: 

computer readable code configured to cause a computer to 
enter said token entries into a token dictionary. 

19. A lexical analyzer, comprising: 

one or more entries configured to be obtained; 
a source program analyzer; 

one or more tokens configured to be generated from said 
source program analyzer, wherein said entries may be 
used to generate a subset of said tokens. 

20. The lexical analyzer of claim 19, wherein said entries 
comprise a language descriptor and a token value. 



21. ITie lexical analyzer of claim 20, wherein said source 
program analyzer comprises: 

an source program interface, wherein said interface 
obtains a lexeme fix)m said a source program; and 

a lexeme comparator, wherein said comparator deter- 
mines whether said lexeme matdies one of said lan- 
guage descriptors. 

22. The lexical analyzer of claim 21, wherein said source 
program analyzer further comprises: 

a token output interface, wherein said interface generates 
said token if said lexeme matches one of said language 
descriptors. 

23. The lexical analyzer of claim 22, wherein said source 
program interface further comprises: 

a source program manager, wherein said manager obtains 
a next lexeme from said source program. 

24. The lexical analyzer of claim 23, wherein said output 
interface comprises: 

a host program event handler, wherein said event handler 
causes said output interface to generate said token 
value in re^onse to a request horn the host program. 

25. The lexical analyzer of claim 24, wherein said lan- 
guage descriptor is a reserved word. 

26. The lexical analyzer of claim 24, wherein said lan- 
guage descriptor is an operator. 

27. The lexical analyzer of claim 19, further comprising: 

a token dictionary, wherein said entries comprise dictio- 
nary entries. 

* ♦ * ♦ * 



